Parsing Conjunctions Deterministically
نویسنده
چکیده
Conjunctions have always been a source of problems for natural language parsers. This paper shows how these problems may be circumvented using a rule.based, walt-and-see parsing strategy. A parser is presented which analyzes conjunction structures deterministically, and the specific rules it uses are described and illustrated. This parser appears to be faster for conjunctions than other parsers in the literature and some comparative timings are given. I N T R O D U C T I O N In recent years, there has been an upsurge of interest in techniques for parsing sentences containing coordinate conjunctions (and, or and but) [1,2,3,4,5,8,9]. These techniques are intended to deal with three computational problems inherent in conjunction parsing: 1. Since virtually any pair of constituents of the same syntactic type may be conjoined, a grammar that explicitly enumerates all the possibilities seems needlessly cluttered with a large number of conjunction rules. 2. If a parser uses a top-down analysis strategy (as is common with ATN and logic grammars), it must hypothesize a structure for the second conjunct without knowledge of its actual structure. Since this structure could be any that parallels some constituent that ends at the conjunction, the parser must generate and test all such possibilities in order to find the ones that match. In practice, the combinatorial explosion of possibilities makes this slow. 3. It is possible for a conjunct to have "gaps" (ellipsed elements) which are not allowed in an unconjoined constituent of the same type. These gaps must be filled with elements from the other conjunct for a proper interpretation, as in: I gave Mary a nickel and Harry a dime. The paper by Lesmo and Torasso [9] briefly reviews which tech. niques apply to which problems before presenting their own approach. Two papers in the list above [1,3] present deterministic, "wait. and-see" methods for conjunction parsing. In both, however, the discussion centers around the theory and feasibility of parsers that obey the Marcus determinism hypothesis [10] and operate with a limited-length Iookahead buffer. This paper examines the other side of the coin, namely, the practical power of the waitand.see approach compared to strictly top-down or bottom-up methods. A parser is described that analyzes conjunction struc. tures deterministically and produces parse trees similar to those produced by Dahl & McCord's MSG system [4]. It is much faster than either MSG or Fong & Berwick's RPM device [5], and comparative timings are given. We conclude with some descriptive comparisons to other systems and a discussion of the reasons behind the performance observed. OVERVIEW OF THE PARSER For the sake of a name, we will call the parser NEXUS since it is the syntactic component of a larger system called NEXUS. This system is being developed to study the problem of learning tech. nical concepts from expository text. The acronym stands for Non.Expert Understanding System. NEXUS is a direct descendent of READER, a parser written by Ginsparg at Stanford in the late 1970's [6]. Like all wait-and-see parsers, it incorporates a stack to hold constituent structures being built, some variables that record the state of the parse, and a set of transition rules that control the parsing process. The stack structures and state variables in NEXUS are almost the same as in READER, but the rules have been rewritten to make them cleaner, more transparent, and more complete. There are two categories of rules. Segmentation rules are responsible for finding the boundaries of constituents and creating stack structures to store these results. Recombination rules are responsible for attaching one structure to another in syntactically valid ways. Segmentation operations are separate from, and always precede, recombination operations. All the rules are encoded in Lisp; there is no separate rule interpreter. Segmentation rules take as input a word from the input sen. tence and a partial-parse of the sentence up to that word. The rules are organized into procedures such that each procedure implements those rules that apply to one syntactic word class. When a rule's conditions are met, it adds the input word to the partial-parse, in a way specified in the rule, and returns the new partial-parse as output. A partial-parse has three parts: 1. The stack: A stack (not a tree) of the data structures which encode constituents. There are two types of structures in the stack, one type representing clause nuclei (the verb group, noun phrase arguments, and adverbs of a clause), and the other representing prepositional phrases. Each structure consists of a collection of slots to be filled with constituents as the parse proceeds. 2. The message (MSG): A symbol specifying the last action performed on the stack. In general, this symbol will indicate the type of slot the last input word
منابع مشابه
Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations
Prepositions and conjunctions are two of the largest remaining bottlenecks in parsing. Across various existing parsers, these two categories have the lowest accuracies, and mistakes made have consequences for downstream applications. Prepositions and conjunctions are often assumed to depend on lexical dependencies for correct resolution. As lexical statistics based on the training set only are ...
متن کاملOn Parsing Binary Dependency Structures Deterministically In Linear Time
In this paper we demonstrate that it is possible to parse dependency structures deterministically in linear time using syntactic heuristic choices. We first prove theoretically that deterministic, linear parsing of dependency structures is possible under certain conditions. We then discuss a fully implemented parser and argue that those conditions hold for at least one natural language. Empiric...
متن کاملJoint Dependency Parsing and Multiword Expression Tokenization
Complex conjunctions and determiners are often considered as pretokenized units in parsing. This is not always realistic, since they can be ambiguous. We propose a model for joint dependency parsing and multiword expressions identification, in which complex function words are represented as individual tokens linked with morphological dependencies. Our graphbased parser includes standard secondo...
متن کاملLearning Grammar with Explicit Annotations for Subordinating Conjunctions
Data-driven approach for parsing may suffer from data sparsity when entirely unsupervised. External knowledge has been shown to be an effective way to alleviate this problem. Subordinating conjunctions impose important constraints on Chinese syntactic structures. This paper proposes a method to develop a grammar with hierarchical category knowledge of subordinating conjunctions as explicit anno...
متن کاملCroatian Dependency Treebank 2.0: New Annotation Guidelines for Improved Parsing
We present a new version of the Croatian Dependency Treebank. It constitutes a slight departure from the previously closely observed Prague Dependency Treebank syntactic layer annotation guidelines as we introduce a new subset of syntactic tags on top of the existing tagset. These new tags are used in explicit annotation of subordinate clauses via subordinate conjunctions. Introducing the new a...
متن کامل